AITopics | visual input

Collaborating Authors

visual input

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation

Neural Information Processing SystemsJun-13-2026, 22:57:12 GMT

Reinforcement learning (RL) has shown promise in enhancing the general Chain-of-Thought (CoT) reasoning capabilities of multimodal large language models (MLLMs). However, when applied to improve general CoT reasoning, existing RL frameworks often struggle to generalize beyond the training distribution. To address this, we propose NoisyGRPO, a systematic multimodal RL framework that introduces controllable noise into visual inputs for enhanced exploration and explicitly models the advantage estimation process via a Bayesian framework. Specifically, NoisyGRPO improves RL training by: (1) \textbf{Noise-Injected Exploration Policy}: Perturbing visual inputs with Gaussian noise to encourage exploration across a wider range of visual scenarios; and (2) \textbf{Bayesian Advantage Estimation}: Formulating advantage estimation as a principled Bayesian inference problem, where the injected noise level serves as a prior and the observed trajectory reward as the likelihood. This Bayesian modeling fuses both sources of information to compute a robust posterior estimate of trajectory advantage, effectively guiding MLLMs to prefer visually grounded trajectories over noisy ones. Experiments on standard CoT quality, general capability, and hallucination benchmarks demonstrate that NoisyGRPO substantially improves generalization and robustness, especially in RL settings with small-scale MLLMs such as Qwen2.5-VL

artificial intelligence, machine learning, proceedings, (8 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.77)

Add feedback

Too Late to Recall: Explaining the Two-Hop Problem in Multimodal Knowledge Retrieval

Neural Information Processing SystemsJun-11-2026, 23:44:27 GMT

Training vision language models (VLMs) aims to align visual representations from a vision encoder with the textual representations of a pretrained large language model (LLM). However, many VLMs exhibit reduced factual recall performance compared to their LLM backbones, raising the question of how effective multimodal fine-tuning is at extending existing mechanisms within the LLM to visual inputs. We argue that factual recall based on visual inputs requires VLMs to solve a two-hop problem: (1) forming entity representations from visual inputs, and (2) recalling associated factual knowledge based on these entity representations. By benchmarking 14 VLMs with various architectures (LLaVA, Native, Cross-Attention), sizes (7B-124B parameters), and training setups on factual recall tasks against their original LLM backbone models, we find that 11 of 14 models exhibit factual recall degradation. We select three models exhibiting high-and two models with low performance degradation, and use attribution patching, activation patching, and probing to show that degraded VLMs struggle to use the existing factual recall circuit of their LLM backbone, because they resolve the first hop too late in the computation. In contrast, high-performing VLMs resolve entity representations early enough to reuse the existing factual recall mechanism. Finally, we demonstrate two methods to recover performance: patching entity representations from the LLM backbone into the VLM, and prompting with chain-of-thought reasoning. Our results highlight that the speed of early entity resolution critically determines how effective VLMs are in using preexisting LLM mechanisms. More broadly, our work illustrates how mechanistic analysis can explain and unveil systematic failures in multimodal alignment.

artificial intelligence, large language model, natural language, (9 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Calibrated Self-Rewarding Vision Language Models

Neural Information Processing SystemsMar-20-2026, 19:44:48 GMT

Large Vision-Language Models (LVLMs) have made substantial progress by integrating pre-trained large language models (LLMs) and vision models through instruction tuning. Despite these advancements, LVLMs often exhibit the hallucination phenomenon, where generated text responses appear linguistically plausible but contradict the input image, indicating a misalignment between image and text pairs. This misalignment arises because the model tends to prioritize textual information over visual input, even when both the language model and visual representations are of high quality. Existing methods leverage additional models or human annotations to curate preference data and enhance modality alignment through preference optimization. These approaches are resource-intensive and may not effectively reflect the target LVLM's preferences, making the curated preferences easily distinguishable.

artificial intelligence, large language model, natural language, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.59)

Add feedback

representations, leading to the ability to process multi-object visual inputs, but falling short of the systematic generalization displayed by human reasoning. Other

Neural Information Processing SystemsFeb-17-2026, 15:43:19 GMT

Co-first author order is arbitrary and may be swapped when citing this work.

artificial intelligence, machine learning, relation, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.28)
North America > United States > New Jersey > Mercer County > Princeton (0.04)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

9766527f2b5d3e95d4a733fcfb77bd7e-Paper.pdf

Neural Information Processing SystemsFeb-10-2026, 03:42:04 GMT

We propose a method for knowledge transfer based on a regularization term in our loss function, supervising the sequence of required reasoning operations.

artificial intelligence, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country: Europe > France > Auvergne-Rhône-Alpes > Lyon > Lyon (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.94)

Add feedback

31a19921acd38cdf7a8c86ec032cef2d-Paper-Conference.pdf

Neural Information Processing SystemsFeb-9-2026, 18:53:17 GMT

behavioral variable, neuron, publisher, (13 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Santa Barbara County > Santa Barbara (0.14)
North America > United States > Oregon > Lane County > Eugene (0.14)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.05)
Asia > India (0.04)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

36ac8e558ac7690b6f44e2cb5ef93322-AuthorFeedback.pdf

Neural Information Processing SystemsFeb-8-2026, 02:04:34 GMT

experiment, feature selection, populationwise fir, (12 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Systematic Visual Reasoning through Object-Centric Relational Abstraction

Neural Information Processing SystemsDec-27-2025, 01:22:46 GMT

Human visual reasoning is characterized by an ability to identify abstract patterns from only a small number of examples, and to systematically generalize those patterns to novel inputs. This capacity depends in large part on our ability to represent complex visual inputs in terms of both objects and relations. Recent work in computer vision has introduced models with the capacity to extract object-centric representations, leading to the ability to process multi-object visual inputs, but falling short of the systematic generalization displayed by human reasoning. Other recent models have employed inductive biases for relational abstraction to achieve systematic generalization of learned abstract rules, but have generally assumed the presence of object-focused inputs. Here, we combine these two approaches, introducing Object-Centric Relational Abstraction (OCRA), a model that extracts explicit representations of both objects and abstract relations, and achieves strong systematic generalization in tasks (including a novel dataset, CLEVR-ART, with greater visual complexity) involving complex visual displays.

object-centric relational abstraction, systematic generalization, systematic visual reasoning, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Vision (0.84)

Add feedback

Curriculum Learning With Infant Egocentric Videos

Neural Information Processing SystemsDec-26-2025, 13:00:36 GMT

Infants possess a remarkable ability to rapidly learn and process visual inputs. As an infant's mobility increases, so does the variety and dynamics of their visual inputs. Is this change in the properties of the visual inputs beneficial or even critical for the proper development of the visual system? To address this question, we used video recordings from infants wearing head-mounted cameras to train a variety of self-supervised learning models. Critically, we separated the infant data by age group and evaluated the importance of training with a curriculum aligned with developmental order. We found that initiating learning with the data from the youngest age group provided the strongest learning signal and led to the best learning outcomes in terms of downstream task performance. We then showed that the benefits of the data from the youngest age group are due to the slowness and simplicity of the visual experience. The results provide strong empirical evidence for the importance of the properties of the early infant experience and developmental progression in training. More broadly, our approach and findings take a noteworthy step towards reverse engineering the learning mechanisms in newborn brains using image-computable models from artificial intelligence.

curriculum learning, infant egocentric video, name change, (3 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision (0.86)
Information Technology > Artificial Intelligence > Machine Learning (0.77)

Add feedback

Filters

Collaborating Authors

visual input

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation

Too Late to Recall: Explaining the Two-Hop Problem in Multimodal Knowledge Retrieval

Calibrated Self-Rewarding Vision Language Models

representations, leading to the ability to process multi-object visual inputs, but falling short of the systematic generalization displayed by human reasoning. Other

9766527f2b5d3e95d4a733fcfb77bd7e-Paper.pdf

31a19921acd38cdf7a8c86ec032cef2d-Paper-Conference.pdf

3c8a49145944fed2bbcaade178a426c4-Paper.pdf

36ac8e558ac7690b6f44e2cb5ef93322-AuthorFeedback.pdf

Systematic Visual Reasoning through Object-Centric Relational Abstraction

Curriculum Learning With Infant Egocentric Videos